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Abstract 

We  consider  the  problems  of  class  probability  estimation  and  classification  when  using  near-neighbor  classifiers, 
such  as  k-nearest  neighbors  (kNN).  This  paper  investigates  minimum  expected  risk  estimates  for  neighborhood  learn¬ 
ing  methods.  We  give  analytic  solutions  for  the  minimum  expected  risk  estimate  for  weighted  kNN  classifiers  with 
different  prior  information,  for  a  broad  class  of  risk  functions.  Theory  and  simulations  show  how  significant  the 
difference  is  compared  to  the  standard  maximum  likelihood  weighted  kNN  estimates.  Comparisons  are  made  with 
uniform  weights,  symmetric  weights  (tricube  kernel),  and  asymmetric  weights  (LIME  kernel).  Also,  it  is  shown  that 
if  the  uncertainty  in  the  class  probability  is  modeled  by  a  random  variable,  and  the  expected  misclassification  cost  is 
minimized,  the  result  is  equivalent  to  using  a  classifier  with  a  minimum  expected  risk  estimate.  For  symmetric  costs 
and  uniform  priors,  it  is  seen  that  minimum  expected  risk  estimates  have  no  advantage  over  the  standard  maximum 
likelihood  estimates.  For  asymmetric  costs,  simulations  show  that  the  differences  can  be  striking. 


1  Introduction 

We  consider  the  standard  supervised  statistical  learning  problem  of  classifying  test  samples  given  a  labeled  database 
of  training  examples,  a  known  finite  set  of  possible  class  labels,  and  a  matrix  of  misclassification  costs.  This  paper 
proposes  and  analyzes  robust  classification  and  estimation  for  near-neighbor  learning,  such  as  k-nearest  neighbors 
(kNN).  Near-neighbor  learning  is  sometimes  also  called  instance-based  learning,  memory-based  learning,  case-based 
reasoning,  or  lazy  learning. 

We  assume  that  a  set  of  training  sample  points  (neighbors),  and  weights  on  those  training  samples  have  been 
chosen  for  a  particular  test  point.  The  paper’s  focus  is  on  how  to  make  optimal  estimates  of  a  class  label  and  class 
probabilities  given  the  set  of  weighted  neighbors.  We  show  that  the  standard  estimation  procedure  is  a  maximum 
likelihood  estimate.  Maximum  likelihood  estimates  can  lack  robustness.  Laplace  smoothing,  or  other  smoothing,  is 
sometime  used  heuristically  in  statistical  learning.  Such  smoothing  can  be  theoretically  justified  as  Bayesian  minimum 
expected  risk  (MER)  estimation.  The  two-class  analytic  MER  solution  for  weighted  near-neighbors  was  given  in  a 
recent  workshop  paper  [1],  In  this  paper  we  investigate  more  deeply;  we  give  the  multi -class  solution,  incorporate 
different  prior  information,  and  show  that  the  same  classification  results  from  replacing  the  class  probability  estimate 
with  a  random  variable  and  directly  minimizing  the  expected  cost.  After  defining  notation  in  Section  2,  we  establish 
that  standard  near-neighbor  learning  uses  maximum  likelihood  estimates  in  Section  3.  MER  estimates  for  the  class 
probability  estimates  are  given  in  Section  4.  We  discuss  different  prior  information  scenarios  in  Section  5.  Classifying 
by  directly  minimizing  expected  misclassification  cost  is  proposed  and  solved  in  Section  6.  In  Section  7,  we  show  by 
theory  and  simulation  how  much  difference  the  MER  estimation  can  make.  Section  8  is  a  summary  and  discussion  of 
open  questions.  All  the  proofs  of  the  mathematical  result  are  given  in  the  Appendix. 
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2  Notation 


Supervised  statistical  learning  is  based  on  a  set  of  given  training  pairs  T  =  {X,;,  Yt  },  where  X,  £  'Rd  and  Y,  £ 
{1,2, . . .  ,  £?},  where  Q  is  finite.  The  training  samples  {(Xj,  Y,)}  and  test  sample  (X,  Y)  are  assumed  to  be  drawn 
independently  and  identically  from  some  sufficiently  nice  joint  distribution  P<x,y)-  One  classification  problem  is  to 
estimate  the  probability  of  each  class  for  a  test  feature  vector,  P(Y  =  g \X  =  x )  for  g  £  {1,2,...,  Q},  based  on  the 
given  training  pairs  T.  A  related  classification  problem  is  to  classify  X  as  label  Y  given  T  and  a  Q/Q  misclassification 
cost  matrix  C,  where  C(g,  h )  specifies  the  cost  of  classifying  a  test  sample  as  class  g  when  the  truth  is  class  h. 

It  will  be  convenient  to  treat  the  unknown  Py\x  as  a  random  vector  0  with  Q  random  components  0g,  where  each 
0g  represents  an  unknown  P(Y  =  g\x)  for  g  £  {1, ...  ,Q}.  A  particular  realization  of  the  random  vector  0  will  be 
the  probability  mass  function  (pmf)  9.  The  random  vector  0  is  distributed  with  density  f(0).  which  we  constrain  to 
have  a  mathematically  nice  and  relevant  formulation  (as  stated  in  the  following  sections).  The  focus  of  this  paper  is  on 
robust  classification  estimates  Y ,  and  robust  estimates  of  the  class  pmf  9  with  components  9g  for  g  £  {1, . . . ,  Q }. 

Lastly,  we  note  that  in  this  paper  we  differentiate  between  near  neighbors  and  nearest  neighbors,  where  nearest 
neighbors  refers  to  the  closest  samples  from  a  training  set,  with  closest  measured  in  terms  of  Euclidean  distance  if  not 
otherwise  noted.  Near  neighbors  is  used  to  more  generally  mean  a  set  of  near  (but  not  necessarily  nearest)  neighbors. 

3  Standard  Near-neighbor  Learning 

There  are  many  approaches  to  supervised  statistical  learning;  we  focus  here  on  near-neighbor  methods.  Such  methods 
are  known  to  perform  well  in  practice,  are  intuitive,  and  can  achieve  optimal  error  rates  asymptotically  [2],  Non- 
parametric  neighborhood  methods  weight  the  training  samples  in  a  neighborhood  around  the  test  point  x.  It  is  not 
important  for  this  paper  how  one  defines  a  neighborhood;  common  definitions  are  to  use  the  fc-nearest  neighbors,  or 
all  neighbors  within  a  defined  radius.  Let  the  sample  pairs  be  re-indexed  by  their  distance  to  the  test  sample  x,  so  that 
Xj  is  the  yth  nearest  neighbor  to  x.  Given  a  neighborhood,  a  weighted  kNN  classifier  assigns  a  weight  Wj  to  each 
neighbor,  usually  by  evaluating  a  kernel  that  assigns  weight  based  on  the  distance  from  x  to  Xj  [2],  though  we  will 
also  report  results  on  a  recent  asymmetric  kernel  [3].  The  kNN  classifier  assigns  equal  weights  to  every  neighbor.  Our 
formulation  will  hold  for  any  weighted  kNN  classifier  where  the  weights  satisfy  wj  =  1  and  wj  —  0. 

3.1  Standard  Estimates  for  Near-neighbor  Learning 

From  the  weights  and  the  neighborhood  sample  pairs,  it  is  standard  to  form  an  estimate  of  the  probability  of  each  class, 

k 

=  El  wjI(Y,=ti)  (1) 

j- 1 

for  g  £  {1, . . . ,  (/ } ,  where  / ( _  :i  is  an  indicator  function  that  equals  one  when  its  argument  is  true,  and  equals  zero 
otherwise.  This  standard  formula  for  class  probability  estimates  dates  back  at  least  to  1977  [4]. 

The  class  pmf  estimate  9  for  the  test  sample  x  can  be  used  to  choose  a  class  label  y.  Given  a  class  pmf  estimate 
9  and  a  misclassification  cost  matrix  C,  it  is  standard  to  classify  x  as  the  class  Y  which  minimizes  the  expected 
misclassification  cost  with  respect  to  the  estimated  pmf  9  such  that  Y  solves 

e 

argmin  E  C(g,h)9h.  (2) 

9  h= 1 

For  uniform  weights,  the  estimate  (1)  is  the  maximum  likelihood  estimate  of  0  given  the  neighborhood  data  samples 
Y\  ,Y->, ... .  Yt,.  of  the  test  feature  vector  x  under  the  assumption  that  these  near-neighbor  samples  were  all  drawn  from 
the  same  pmf.  That  is,  if  mg  of  the  near-neighbors  are  of  each  class  g  out  of  a  total  of  k  near-neighbors,  then 
the  estimate  (1)  is  9g  =  mg/k,  which  maximizes  the  likelihood  f(9)  of  independently  identically  drawing  those 
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neighbors,  where 
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Maximum  likelihood  (ML)  estimates  can  be  quite  unrepresentative  of  the  underlying  likelihood  distribution  when 
small  sample  sizes  are  used.  Near-neighbor  algorithms  are  often  run  with  small  neighborhood  sizes,  which  yields 
small  sample  sizes  for  the  estimation  done  in  (1),  and  thus  we  hypothesize  that  a  different  estimation  principle  could 
make  a  difference  in  practice.  See  [5,  pgs.  300-310]  and  [6,  ch.  15]  for  further  discussions  of  the  problems  with 
maximum  likelihood  estimation. 

Next,  we  define  a  weighted  likelihood  function  and  show  that  for  nonuniform  weight  vectors  w,  the  estimate  (1)  is 
similarly  the  maximum  weighted  likelihood  estimate,  and  can  be  expected  to  have  similar  problems  of  high  estimation 
variance  when  used  with  relatively  small  neighborhood  sizes  k. 

Let  f(9)  be  the  likelihood  of  drawing  the  neighborhood  samples  with  each  neighborhood  sample  weighted  by  wf 


m  = 


k  g 

nn»; 

3= 1  9=1 


ikl , 


(.Yj=g) 


g 

n 

9=1 


'9 


ai1(Yi=g) 


(4) 


where  the  multiplicative  constant  term  of  f(ff)  has  been  dropped  because  it  will  not  affect  the  minimization  problems 
we  will  solve  to  estimate  6.  Note  that  for  kNN  the  weights  are  uniform —  Wj  =  1  / k  for  all  j  —  and  then  the  formulas 
(4)  and  (3)  are  equivalent  (up  to  a  normalization  constant). 


Lemma  1  The  class  probability  estimate  given  in  (1)  maximizes  the  weighted  likelihood  given  in  (4)  subject  to  the 
constraints 

Q  k 

E^  =  1’E^  =  1-  (5) 

9=1  3  =  1 


4  Minimum  Expected  Risk  Estimates 

In  order  to  form  more  robust  estimates  of  the  class  label  and  the  class  pmf  for  a  test  sample,  we  propose  applying  to 
near-neighbor  learning  a  principle  of  estimation  more  robust  than  maximum  likelihood  estimation.  Minimizing  the 
expectation  of  a  relevant  risk  R  where  the  expectation  is  taken  over  all  possible  pmf’s  will  yield  robust  results  in  terms 
of  average  performance.  Minimizing  the  maximum  error  could  further  bound  the  possible  error,  but  at  the  expense  of 
suboptimal  estimates  on  average.  We  apply  a  Bayesian  minimum  expected  risk  (MER)  principle  [7,  ch.  4]  to  estimate 
the  class  probabilities.  The  MER  estimate  of  the  class  conditional  probability  9m er  solves 

argmin  f  R{9,9)f{9)d91  (6) 

e  J 

where  f(9)  is  the  probability  of  pmf  9  being  the  true  underlying  pmf,  and  R  is  some  non-negative  function  (such  as 
mean-squared  error,  or  relative  entropy)  suitable  for  measuring  distortion. 

The  minimization  problem  (6)  can  be  rewritten  as 

argmin  Eq  [  f?(0,  9)  } 
e 


where  0  is  distributed  with  probability  density  f(9). 

Let  f(9)  be  the  likelihood  of  the  weighted  neighbor  class  labels  as  given  in  (4).  This  is  equivalent  to  defining  f(9) 
to  be  the  posterior  with  a  uniform  prior  on  the  random  variable  0.  Theorem  1  establishes  the  analytic  solution  for 
estimates  for  the  class  of  Bregman  divergence  risk  functions  R  [8],  which  include  the  standard  squared  error  loss  and 
relative  entropy. 
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Definition  of  Bregman  Divergence  [9]  Let  ip  €  C2  :  9  G  [0,  l]s  — >  f?  be  a  strictly  convex  function.  The  Bregman 
divergence  is  defined  as 


d^(0, f)  =  ip{9)  -  -  {0  - 


(7) 


Theorem  1  Let  the  probability  of  the  multinomial  pmf  of  9  be  of  the  form 

S 

=  normalization  constant. 

9- 1 

The  MER  estimate  9  of  the  unknown  pmf  solves, 


9  =  argmin  f  R(9,(/>)f(9)d9, 

<t>  J 9 

where  Jg  denotes  a  multiple  integral  with  region  of  integration  such  that  ^9=1  @g  =  —  0- 

Then  for  any  Bregman  divergence  risk  R(9,  <fi)  =  d^(9,  (p),  the  MER  estimate  gives  the  probability  of  the  gth  class, 


9 


9 


Olg  +  1 

E  Ua9  +  G' 


g  =  i,...,g. 


(8) 


In  accordance  with  the  above  theorem,  the  MER  estimate  for  uniform  weights  for  either  mean-squared  R  or  relative 
entropy  R  is 

*•  ^  tSr 


for  g  £  {1,  2, ,  Q }.  This  estimate  is  equivalent  to  Laplace  correction  for  estimating  multinomial  distributions 
[10,  pg.  272],  also  called  Laplace  smoothing.  Appropriately,  the  history  of  Laplace  correction  goes  back  to  Laplace 
himself;  Jaynes  offers  historical  information  and  more  details  about  alternate  derivations  [11,  pgs  154-165],  Laplace 
correction  has  been  shown  to  be  useful  for  class  probability  estimation  in  decision  trees  [12,  13,  14,  15],  and  with 
Naive  Bayes  [16],  Laplace  correction  was  incorporated  in  the  CN2  rule  learner  [17],  and  Domingos  used  it  to  break 
ties  in  a  unified  instance-based  and  rule -based  learner  [18].  Many  different  smoothing  approaches  are  used  in  speech 
recognition  [19]. 

More  generally,  applying  Theorem  1  to  find  the  MER  estimate  for  weighted  neighbors  yields 


k  E?= 1  Wj I{Yj=g)  +  1 
k  +  Q 


(10) 


for  any  Bregman  divergence.  More  information  on  the  Bregman  divergences  can  be  found  in  recent  papers  [20,  21]. 


5  Prior  Distributions  on  0 

Thus  far  a  uniform  prior  distribution  on  0  has  been  assumed,  so  that  the  posterior  f(9)  is  the  likelihood  of  9,  as  per  (4). 
However,  consider  a  two-class  problem  and  uniform  prior  class  probabilities  {tti  =  .5,  tt;  =  .5};  there  are  still  many 
possible  prior  probabilities  over  the  class  pmf  9.  At  one  extreme,  the  prior  is  q(9)  =  .55([0i,  1  —  6*2] )  -L -5<5(  [1  —  9±,  9-2]), 
where  we  use  the  standard  Dirac  delta  generalized  function  notation  to  express  that  the  prior  probability  has  half  of 
its  support  on  the  pmf  [0, 1],  and  half  on  the  pmf  [1,0].  At  the  other  extreme  is  the  uniform  prior  q{9)  =  l/s/2, 
which  corresponds  to  every  possible  9  being  equally  likely.  Both  yield  the  same  marginal  class  prior  probabilities 
{tti  =  .5,7T2  =  .5}. 

Better  performance  can  be  expected  with  a  prior  that  better  represents  the  prior  knowledge.  In  this  section  we 
consider  some  cases  and  approaches  to  prior  information. 
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5.1  Zero  Bayes’  Risk 

Suppose  the  Bayes’  risk  for  a  learning  problem  was  known  to  be  zero,  so  that  the  true  class  pmf  9  for  any  test  sample  x 
is  9g  =  1  for  some  class  g  and  ()/,  =  0  for  all  other  classes  h.  Then  a  uniform  prior  on  0  will  yield  grossly  inaccurate 
estimates,  as  the  MER  estimation  takes  into  account  the  likelihood  of  a  large  set  of  9's  that  will  never  occur  in  this 
problem.  (The  likelihood  of  those  non-possible  9's  will  sometimes  be  non-zero  because  the  underlying  assumption 
that  x’s  neighbors  have  been  drawn  from  the  same  9  as  x’s  class  does  not  hold).  Using  MER  with  the  correct  prior 
information  for  this  case  is  trivially  equivalent  to  using  ML. 

5.2  Limited  Class  Probabilities 

Another  interesting  example  will  be  covered  in  a  simulation  in  Section  7.4.  In  that  two-class  simulation,  9\  <  .8  for 
all  samples.  Using  a  uniform  prior  for  9  includes  the  likelihood  of  9\  from  .8  to  1,  and  this  incorrect  inclusion  leads 
to  some  inaccurate  estimates.  More  generally,  in  a  practical  learning  problem,  one  may  know  that  there  are  no  feature 
vectors  x  for  which  the  probability  of  being  class  one  is  greater  than  some  a.  Here  we  give  an  analytic  result  for  the 
two-class  case  for  this  type  of  prior  information. 

Lemma  2  For  a  two-class  classification  problem,  let  the  probability  of  9  be  the  product  f{9)q{9),  where  q{9)  is 
uniform  over  the  region  [0,  a]  for  some  a  <  1,  and  f{9)  has  the  form, 

a 

/(0)  =  7LK9’ 

5=1 


where  7  is  a  normalization  constant. 

Then  the  MER  estimate  9  of  the  unknown  pmf  solves 


9  =  argmin  f  R{9,(j))f{9)q{9)d9 , 

<t>  J 6 


where  fs  denotes  an  integral  with  region  of  integration  such  that  9\  +  9^  =  1,  and  9 1  >  0,  $2  >  0. 
Then  for  any  Bregman  divergence  risk,  the  MER  estimate  gives  the  probability  of  class  one  to  be 


a  _  ^(a, «i  +  2, «2  +  1)  .... 

9\  —  757 - n - iTu  i11) 

O (CL,  OL\  +  1,  Ct2  +  1) 

where  B  is  the  incomplete  beta  function. 

Applying  Lemma  2  to  the  case  of  interest,  aq  =  k  ]Cj=i  wjI(Yj= 1)  and  «2  =  k  ]Cj=i  wjI(Yj=2)<  then  the  estimate 

(11)  is. 


X  _  B (°»  k  E}=1  +  2,  k  J2j=1  WjI<yJ=  2)  +  1) 

C/1  —  - T - T - . 

B(a>  k  EjLl  WjI(Y^X)  +  1,  k  Ej=1  WjI(Y}=  2)  +  1) 

Of  course,  when  a  =  1,  then  (1 1)  is  equivalent  to  (8)  for  the  two-class  problem. 

In  practice,  one  may  or  may  not  have  an  idea  of  what  prior  information  to  encode.  Incorrect  prior  assumptions  can 
significantly  reduce  the  effectiveness  of  MER  estimation,  as  we  will  see  in  Subsection  7.2  on  simulations.  In  the  next 
subsections  we  discuss  how  to  use  global  likelihood  information  for  prior  information. 


5.3  Prior  Based  on  Global  Likelihood 

The  estimated  9  is  based  only  on  a  small  set  of  data  that  are  local  enough  to  the  test  point  x  to  be  considered  relevant. 
This  local  sample  set  may  be  too  small  and  random  to  accurately  communicate  the  true  local  9.  The  entire  training  set 
of  n  training  samples  is  much  larger,  and  may  thus  provide  a  more  accurate  assessment  of  the  class  probabilities  Pi- 
unconditioned  on  any  test  point  x.  Thus  the  entire  sample  set  may  be  used  to  form  a  prior  for  the  class  probabilities  9 
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that  is  more  accurate  than  a  uniform  prior.  Given  no  prior  information,  it  may  be  useful  to  use  the  global  likelihood  of 
the  class  labels  as  a  prior  over  the  class  probabilities  random  variable  0.  The  global  likelihood  is 


ft*?  <‘3> 

3=4 


where  there  are  rg  training  samples  of  class  g  out  of  n  total  training  samples.  Because  n  is  relatively  large  compared 
to  k,  this  global  likelihood  distribution  will  have  a  relatively  narrow  peak.  If  such  a  peaked  distribution  is  used  as  a 
prior,  no  local  likelihood  will  affect  the  estimate  significantly,  since  the  local  likelihood  is  based  on  a  subset  of  the 
training  samples  and  thus  will  be  less  peaked. 

For  this  reason,  we  propose  using  the  global  likelihood  as  a  prior  but  giving  it  less  weight,  which  will  make  it  less 
narrowly  peaked.  Based  on  the  global  likelihood,  define  a  prior  function  on  9, 

m  =  nof*\  (i4) 

3=1 


where  the  unneeded  normalization  constants  have  been  dropped  and  the  variable  v  acts  as  an  artificial  number  of 
sample  points  that  the  global  likelihood  prior  is  based  on.  For  v  =  n,  the  global  likelihood  prior  is  the  same  as  the 
global  likelihood  (up  to  a  constant).  In  practice,  v  will  be  set  much  smaller  so  that  the  prior  does  not  overwhelm  the 
local  likelihood  in  the  MER  estimation.  Using  this  prior  (14),  the  local  posterior  f{9)  used  for  test  sample  x  is  then 
the  posterior  on  9  formed  by  multiplying  the  weighted  likelihood  (4)  by  the  global  likelihood  prior  (14): 


m  =  n 

3=1 


kJ2j  =  lWjI(Yj=g)\  ( 
9 


•)  (of  ■  ■ ’)  . 


It  is  useful  to  rewrite  this  as 


kJ2j  =  l  wji{Yj=g)+vC- 
3 


m  =  II  e  : 

3=1 

Using  the  above  f(9)  and  applying  Theorem  1,  the  MER  estimate  is 


k  EjU  wJI(Y]=g)  +*>(%)  +  ! 
k  +  v  +  Q 


(15) 


for  either  mean-squared  error  R  or  relative  entropy  R,  or  other  Bregman  divergence  risk. 

The  value  of  v  should  represent  how  accurate  and  useful  the  practitioner  thinks  the  global  likelihood  is  as  a  prior 
distribution,  versus  the  uncertainty  of  using  the  local  k  neighbors  to  estimate  9.  In  practice,  we  propose  using  cross- 
validation  to  train  both  v  and  the  number  of  neighbors  k. 

Using  the  empirical  global  likelihood  still  implies  that,  a  priori  to  the  empirical  global  likelihood  information,  there 
was  a  uniform  prior  on  0.  Established  alternatives  to  this  are  the  invariant  prior  IIg=i  anc*  4ic  concept  of  a  data 

translated  likelihood  that  leads  to  the  noninformative  prior  [22]  IIg=r  ^3  2  • 

Incorporating  global  likelihood  information  was  reported  to  work  well  on  a  related  problem  for  rule  learning  [23]. 
That  approach  was  called  “m-estimates”  based  on  Cestnik’s  earlier  work  on  Naive  Bayes  [16].  Their  m-estimate  is  a 
weighted  average  of  the  rule’s  empirical  training  samples  and  the  global  training  samples.  Interpreting  a  probability 
estimate  as  a  weighted  average  of  the  empirical  distribution  and  some  more  general  distribution  dates  back  to  [24];  we 
discuss  Carnap’s  viewpoint  and  how  it  adds  intuition  to  our  presented  estimation  results  in  Subsection  5.6. 


5.4  Weighted  Global  Likelihood  Prior 

An  extension  of  the  proposed  global  likelihood  prior  given  in  ( 14)  is  to  weight  the  entire  training  set  to  form  a  weighted 
global  likelihood  prior  t(9)  that  is  specific  to  the  test  point  x.  For  example,  one  could  weight  the  entire  training  set 
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based  on  each  feature  vector  Xfs  distance  to  the  test  point  x.  Let  this  weighted  prior  be 


m  = 


n« 

9=1 


’E"=1 


l<Xi=g) 


where  constants  have  been  dropped,  Ui  is  the  weight  placed  on  the  zth  training  sample  and  may  be  a  function  of  x,  and 
v  acts  as  the  number  of  points  drawn  from  the  weighted  global  empirical  class  distribution. 

Multiplying  the  above  weighted  global  likelihood  prior  by  the  local  likelihood  forms  a  posterior  over  the  class  pmf 

0: 

f{9)  =  f[  (16) 

9=1 

Applying  Theorem  1  to  solve  the  MER  estimation  problem  with  the  weighting  f{ff)  from  (16)  yields  the  estimate 


oa  = 


k  EjLl  wihx=g)  +  V  EIU  uiI(Yi=g)  +  1 


(17) 


The  Ui  weights  can  be  used  to  “localize”  the  global  likelihood  by  decaying  the  weight  on  points  with  distance  to 
the  test  point.  One  idea  would  be  to  only  weight  points  “somewhat”  local  to  the  test  point,  in  an  attempt  to  optimize 
a  trade-off  between  a  greater  number  of  samples  for  estimation  accuracy  and  sample  locality  to  the  test  point  for 
estimation  relevance. 


5.5  Maximum  A  Posteriori  (MAP)  Estimates 

The  weighted  global  likelihood  given  in  (16)  can  similarly  be  used  to  create  a  MAP  estimate: 
Lemma  3  The  class  pmf  estimate 


k  EjL  1  w3I(Yj=g)  +  V  E"=l  UiI(Yi=g) 


(18) 


is  the  maximum  a  posteriori  estimate  for  the  weighted  global  likelihood  prior  given  in  (16),  subject  to  the  constraints 


5=1 


k 


!>  =  1 

3  =  1 


Problems  can  occur  with  the  MAP  estimate  when  using  standard  non-uniform  priors, 
established  noninformative  prior  Ilg=i  E  12  [22]  leads  to  a  MAP  estimate  of 


For  example,  the  well- 


mg  —  1/2 

k-  1 


which  for  one  near-neighbor  will  be  infinite. 

5.6  Carnapian  Interpretation 

The  proposed  MER  estimate  with  a  global  likelihood  prior  can  be  interpreted  within  an  estimation  framework  proposed 
by  Carnap  in  1952.  Although  Carnap’s  views  were  not  Bayesian,  he  proposed  a  general  continuum  of  induction 
rules  that  correspond  to  a  Bayesian  minimum  expected  risk  estimation  framework  using  a  range  of  different  prior 
information  [11,  pg.  279],  Carnap  noted  that  there  were  two  extremes  to  the  multinomial  estimation  problem  (Carnap 
and  Jaynes  both  gave  binomial  examples,  but  their  logic  extrapolates  straightforwardly  to  the  multinomial  case).  At  the 
one  extreme  is  the  empirical  distribution  6g  =  mg/k.  At  the  other  extreme  is  what  Carnap  refers  to  as  a  logical  factor, 
which  corresponds  to  an  uninformed  guess,  such  as  the  estimate  0g  =  1  /Q.  Carnap  noted  that  experts  in  his  time 
agreed  that  the  best  estimate  is  somewhere  between  those  two  extreme  estimates,  and  considered  a  convex  weighting 
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of  the  two  extreme  estimates  to  form  a  continuum  of  inductive  rules.  Any  point  on  Carnap’s  continuum  is  seen  to 
correspond  to  a  different  prior  in  the  Bayesian  minimum  expected  risk  framework. 

The  Laplace  correction  estimate  (9)  lies  along  this  continuum,  as  pointed  out  by  Carnap  himself  [24,  pg.  35]. 
Rewrite  (9)  as 


e 


9 


+  <?(£) 
k  +  g 


(19) 


The  above  re-expression  of  the  Laplace  correction  estimate  can  be  interpreted  as  the  weighted  average  of  k  points  from 
the  empirical  distribution  mg/k  for  g  £  {1, . . . ,  Q}  and  Q  points  from  the  uniform  distribution  prior  over  the  classes 
[11,  pg  158]. 

Similarly,  the  MER  estimate  using  a  global  likelihood  weighted  prior  and  uniform  weights  on  the  neighborhood 
points  ( Wj  =  1/k  for  all  j)  can  be  re-expressed: 


fcffl  +  *(£)+g(£) 
k  +  v  +  g 


(20) 


The  above  re -expression  of  the  MER  estimate  can  be  interpreted  as  the  weighted  average  of  k  points  from  the  local  em¬ 
pirical  distribution  mg/k  for  g  £  {1, . . . ,  g},  v  points  from  the  global  empirical  distribution  rg/n  for  g  £  {1, . . . ,  C?}, 
and  g  points  from  the  uniform  distribution  over  the  classes. 


6  Minimizing  Expected  Cost 

Ideally,  the  estimated  class  would  minimize 


g 

argmiii  ^  C(g,  h)6h,  (21) 

9  h=l 

where  9  =  (6 1, . . . ,  9g)  is  the  underlying  probability  density  over  the  class  labels  {1,2,...,  g}.  But  in  practice  the 
true  class  probability  distribution  9  is  not  known,  so  it  is  estimated  by  maximum  likelihood  (or  as  proposed  in  this 
paper,  minimum  expected  risk)  based  on  the  training  set  T.  Then  the  formula  (2)  is  used  to  classify  x  as  some  Y. 

In  this  way,  standard  near-neighbor  classification  is  a  two-step  process,  where  first  a  class  pmf  9  is  estimated,  and 
then  the  class  with  the  minimum  expected  cost  is  estimated  as  in  (2).  In  this  section,  we  propose  classifying  in  one 
step.  The  one-step  estimated  class  will  minimize  the  expected  misclassification  cost,  where  the  expectation  is  over  the 
class  pmf  as  well  as  over  the  misclassification  cost  conditioned  on  the  class  pmf.  In  this  way,  one  directly  minimizes 
the  expected  misclassification  cost,  and  does  not  need  to  form  an  intermediate  class  pmf  estimate  with  respect  to  some 
risk  R  on  the  pmf  estimates.  This  also  avoids  the  question  of  which  risk  function  R  is  appropriate  to  use. 

As  analyzed  by  [25],  classification  is  robust  to  large  errors  in  the  class  pmf  estimate  if  the  errors  are  in  the  “right” 
direction.  Thus  one  might  expect  different  classification  results  by  directly  estimating  the  class  that  minimizes  the 
expected  misclassification  cost,  since  the  intermediate  step  of  estimating  the  class  pmf  9  is  skipped. 

We  propose  that  in  equation  (21)  the  uncertainty  about  9  be  modeled  by  a  random  vector  0.  Then  the  class  is 
estimated  by  minimizing  the  expected  value  of  Y^t=i  C(5>  h)Qh  over  the  class  labels,  where  the  expectation  is  taken 
with  respect  to  random  vector  0: 


g 

argminPe]^^  C(g,  h)Oh],  (22) 

9  h=  1 

where  0  =  (0i,  02,  •  •  • ,  0g)  £  [0, 1]®  is  a  random  vector  such  that  YYi= i  =  1- 

The  following  corollary  to  Theorem  1  establishes  that  the  result  of  this  one-step  classification  is  in  fact  equivalent 
to  the  two-step  classification  given  in  (2)  using  the  MER  estimate  if  f(9)  is  defined  the  same  in  both  classification 
approaches. 

Corollary  1  Suppose  that  P{0  =  9}  =  f{9)  =  f(6i,  02,  •  ■  • ,  9g)  =  7 IIg=i  (likelihood function  or  a  posteriori 
function ).  Let  C(g,  h )  be  the  cost  of  estimating  class  g  when  the  truth  is  class  h.  Then  choosing  a  class  label  g  as  per 
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(22)  is  equivalent  to 


argmin  E  C(g,h)9h 


Oth  +  1 

J2h=  1  ah  +  G 


7  Is  the  Difference  Significant? 

The  MER  estimates  theoretically  minimize  the  expected  risk  (and  classification  error).  In  practice,  is  the  difference 
between  MER  and  ML  estimation  significant  for  near-neighbor  learning?  And  how  sensitive  is  the  MER  estimate  to 
the  assumed  prior?  We  turn  first  to  the  first  question.  In  the  simulations  of  Subsection  7.2,  we  investigate  the  second 
question. 

Given  zero  neighbors  of  class  one  out  of  two  neighbors,  or  zero  neighbors  of  class  one  out  of  one  thousand 
neighbors,  the  ML  estimate  is  the  same.  The  ML  estimate  is  only  a  function  of  the  ratio  of  each  class’s  neighbors  to 
the  total  number  of  neighbors  k.  The  MER  estimate  can  be  written 

mg  |  1 

theta  q  =  — - fT-. 

1+f 

From  the  above,  it  is  seen  that  the  MER  estimate  is  a  function  of  the  ratio  mg/k,  but  is  also  a  function  of  As  k  grows 
larger  the  MER  estimate  moves  away  from  the  uniform  (or  other  prior),  and  closer  to  the  neighbor’s  empirical  class 
distribution.  The  smaller  k  is,  the  larger  the  difference  between  the  MER  and  ML  estimates.  In  the  limit  of  k  — >  oo, 
the  ML  and  MER  class  probability  estimates  for  a  test  point  x  converge.  However,  kNN  algorithms  are  often  run  for 
small  values  of  k,  including  k  =  1.  Also  different  from  ML,  the  MER  estimate  is  a  function  of  the  number  of  classes 
Q.  and  for  larger  numbers  of  classes  and  the  same  number  of  neighbors  k,  the  estimate  is  less  trusting  of  the  empirical 
class  distribution  and  closer  to  the  prior. 

For  classification  with  two  classes,  the  class  label  is  determined  by  whether  the  class  probability  estimate  for  class 
one  is  above  or  below  a  threshold.  For  symmetric  misclassification  costs,  the  threshold  is  set  at  .5.  For  the  two- 
class  problem  with  class  one  and  class  two,  the  classification  threshold  t  is  theoretically  optimally  set  [2]  to  minimize 
expected  cost  at 

t= _ W) _  (23) 

'  (7(1, 2)  +  (7(2, 1) '  j 

In  practical  learning  problems  such  as  computer-aided  diagnostics  of  medical  problems,  the  costs  can  be  extremely 
asymmetric. 

For  symmetric  classification  costs,  the  classification  decision  will  be  unchanged  given  a  MER  or  ML  estimate  of 
the  class  pmf,  as  stated  in  the  following  lemma  for  the  two-class  problem. 


Lemma  4  Let  </>  be  a  classifier  that  classifies  a  given  test  sample  x,  then 


if  Si  ,MER  >  \ 

otherwise. 


In  the  simulations  of  Subsection  7.2  we  see  that  the  farther  the  threshold  is  from  .5  (due  to  asymmetric  misclassi¬ 
fication  costs),  the  larger  the  difference  between  the  two  estimates. 

7.1  Asymptotics 

Near-neighbor  classifiers  have  well-studied  asymptotic  behavior.  We  note  that  using  MER  estimation  instead  of  ML 
estimation  will  not  change  the  standard  asymptotic  near-neighbor  results. 
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A  supervised  learning  algorithm  is  Lr  consistent  if,  when  ( X ,  Y),  (X±,  Yi),  (X2,  Y2),  ■  ■  ■ ,  (Xn,  Yn)  are  iid,  Y  is 
real-valued,  r  >  1,  and  .E[|Y|r]  <  00,  then  Y(X)  — >  _E[Y|X]  in  Lr.  Using  ML  estimation,  many  near-neighbor 
classification  methods  are  consistent  under  the  standard  assumptions  that  the  near-neighbors  are  the  k  nearest,  that  the 
k  — »  00,  while  the  total  number  of  samples  n  — >  00  and  k/n  — ■>  00  [4,  6,  26], 

Using  uniform  prior  information,  MER  estimates  as  per  (10)  are  trivially  consistent  for  cases  where  maximum 
likelihood  estimates  are  consistent,  since  the  MER  estimate  converges  to  the  ML  estimate  as  k  — >  00. 

For  finite  k  and  n  — »  00,  asymptotic  results  relate  the  kNN  error  to  the  Bayes’  error  [27,  6],  By  Lemma  4,  the 
MER  results  will  be  the  same  as  the  standard  ML  results  in  these  cases. 

7.2  Simulations 

To  explore  the  difference  between  the  MER  and  ML  estimates,  we  present  two  simulations.  Each  simulation  measures 
the  misclassification  costs  in  a  two-class  problem  over  a  range  of  thresholds  t  where  t  is  related  to  the  misclassification 
costs  as  per  (23),  and  (7(1, 1)  =  (7(2,2)  =  0.  Over  the  range  of  t  =  0  to  t  =  1,  the  cost  matrix  changes  so  that 
(7(1, 2)  +  (7(2, 1)  =  1  is  held  constant. 

There  are  many  nonparametric  neighborhood  classifiers;  see  [6,  2,  28]  for  reviews  and  discussion.  Results  are 
given  for  kNN,  for  the  symmetric  tricube  kernel  [2,  p.  168],  and  for  a  recent  asymmetric  kernel  method  that  uses  linear 
interpolation  and  the  principle  of  maximum  entropy  (LIME)  [29,  3,  26], 

The  tricube  kernel  is  representative  of  the  general  class  of  positive,  symmetric,  smoothing  kernels.  Given  a  test 
sample  x  and  k  training  samples  {xi,X2,  ■  ■  ■ ,  £/,-},  the  tricube  weight  Wj  is 

^  _  (l-||s-s3-|||)3 

J  Esfc=i(i-  Ik-Zilli)3 

The  LIME  weights  solve  a  minimization  problem.  Let  W  be  the  collection  of  all  probability  mass  functions  w, 
that  is,  all  n-tuples  for  which  Wi  >  0  if  i  <  k  and  w;,;  =  0  otherwise,  and  ]T7  Wi  =  1,  for  all*  =  1, . . . ,  n.  Then  the 
LIME  weights  w*  solve 

D  (V  WjXj  -  -  XH(w)^j  ,  (25) 

where  D  is  some  convex  distortion  function  and  H (w)  is  the  Shannon  entropy.  The  first  term  of  the  LIME  mini¬ 
mization  attempts  to  find  weights  that  solve  the  linear  interpolation  equations,  making  x  the  center  of  its  weighted 
neighbors  x *.  This  is  directly  related  to  reducing  the  first-order  bias  of  the  estimate.  The  second  term  of  the  LIME 
minimization  attempts  to  maximize  entropy,  which  keeps  the  variance  of  the  estimate  low.  The  LIME  weights  are 
defined  in  terms  of  a  trade-off  parameter  A.  Although  A  can  be  trained  using  cross-validation,  for  these  comparisons 
we  set  A  to  a  default  low  value  (A  =  10-6).  Squared  Z2  distance  is  used  for  I),  and  the  optimization  of  (25)  is  done 
with  a  fast  primal-dual  log-barrier  interior-point  method. 

7.3  Unit  Square  Simulation 

For  the  unit  square  simulation,  training  samples  and  test  samples  are  independently  and  identically  drawn  uniformly 
from  a  two-dimensional  unit  square.  Each  sample  Xt  has  a  probabilistic  class  label  based  on  the  sum  of  its  components: 
P(Y  =  2)  =  ,5Xj[l]  +  .57Q[2]  and  P(Y  =  1)  =  1  —  P(Y  =  2).  The  left  side  of  Figure  1  shows  an  example  of  1000 
sample  points. 

Note  that  for  this  simulation  the  prior  assumptions  behind  the  MER  estimate  (10)  hold  because  the  true  class 
probability  9  is  in  fact  uniform. 

A  set  of  1000  test  points  was  drawn,  and  then  for  each  of  50  runs  of  the  simulation,  100  different  training  points 
were  drawn.  The  number  of  neighbors  ranged  from  1  to  10  nearest-neighbors  in  terms  of  Euclidean  distance.  Results 
were  averaged  over  the  1000  test  points  and  the  50  sets  of  different  training  samples.  Figure  2  (left  side)  shows  the 
performance  of  INN  (which  is  the  same  for  all  near-neighbor  methods).  The  cost  curves  are  piecewise  linear  because 
with  INN  ML  the  class  probability  estimates  are  either  9\  =  0  or  9\  =  1,  and  thus  the  classification  errors  are  the 
same  for  t  <1/2  and  for  t  >  1/2.  Since  the  costs  go  as  t,  the  same  number  of  classification  errors  appears  as  a  linear 
cost  segment  in  the  figure.  For  INN  MER,  the  class  probability  estimates  are  either  9 1  =  1/3  or  9\  —  2/3,  and  thus 
the  classification  errors  are  the  same  for  t  <  1/3,  for  1/3  <  t  <  2/3,  and  for  /  >  2/3. 


argmin 
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Figure  1:  (left)  Example  of  1000  training  points  from  the  unit  square  simulation,  (right)  Example  of  200  training 
points  from  the  Gaussian  simulation. 


Figure  2:  (left)  Results  from  the  unit  square  simulation  with  one  nearest  neighbor,  (right)  Results  from  the  Gaussian 
simulation  with  one  nearest  neighbor,  (blue  =  ML;  red  =  MER  ) 


Figure  3  shows  the  performance  using  five  nearest  neighbors,  and  Figure  4  shows  the  performance  using  ten  nearest 
neighbors.  As  predicted  theoretically,  the  MER  estimates  perform  better,  the  performance  difference  is  generally 
greater  for  more  disparate  misclassification  costs,  and  the  performance  difference  shrinks  as  k  becomes  larger.  Results 
using  10, 000  training  samples  for  each  of  50  runs  were  very  similar  to  the  results  presented  here,  which  used  100 
training  samples  for  each  of  50  runs. 

A  different  perspective  on  the  same  results  is  given  in  Figure  5,  where  the  ratio  of  the  misclassification  costs  is 
plotted  for  one  near-neighbor  and  five  near-neighbors.  Cost  ratios  for  very  asymmetric  costs  (t  <  .2  or  t  >  .8)  are 
striking. 

7.4  Gaussian  Simulation 

A  popular  simulation  example  is  used  based  on  Gaussians  with  the  same  center  [30,  31,  2,  3],  Training  and  test  samples 
are  drawn  iid  in  a  two-dimensional  Euclidean  space,  and  are  equally  likely  to  be  from  class  one  or  class  two.  Class  one 
points  are  distributed  as  a  Gaussian,  A/"(0,  £),  where  the  covariance  matrix  S  is  the  2x2  identity  matrix.  Class  two 
points  are  similarly  distributed  as  a  Gaussian,  Af(0, 4E).  A  two-dimensional  example  of  200  sample  points  is  given  in 
the  right  side  of  Figure  1 . 
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Figure  3:  Results  from  the  unit  square  simulation  with  five  nearest  neighbors,  (blue  =  ML;  red  =  MER) 


Figure  4:  Results  from  the  unit  square  simulation  with  ten  nearest  neighbors,  (blue  =  ML;  red  =  MER) 
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Figure  5:  Left:  Cost  ratio  for  the  unit  square  simulation  for  one  nearest  neighbor.  Right:  Cost  ratios  for  the  unit  wquare 
simulation  for  five  nearest  neighbors,  (blue  =  kNN;  red  =  tricube;  green  =  LIME) 


Figure  6:  Results  from  the  Gaussian  simulation  with  five  nearest  neighbors,  (blue  =  ML;  red  =  MER;  green  =  MER 
with  prior;  where  the  red  line  disappears,  it  overlaps  the  green  line.) 


For  this  simulation  the  prior  assumption  of  uniformly  likely  class  pmf  9  used  to  derive  the  MER  estimate  (10)  does 
not  hold.  For  one,  the  probability  of  class  one,  9\,  is  never  greater  than  .8  because  the  maximum  of  the  ratio  of  the 
class  one  pdf  to  the  sum  of  the  class  pdfs  is  .8: 

AA(0,E) 

™aXAA(0,E)+AA(0,4S) 

In  practice,  it  might  be  difficult  to  know  what  an  appropriate  range  for  the  prior  likelihood  of  9  is.  In  order  to 
investigate  how  much  prior  information  matters,  we  compare  ML,  MER  (with  uniform  prior  on  9),  and  MER  with  a 
prior  that  restricts  ()  \  <  .8,  as  per  (11).  Classifying  based  on  one  nearest-neighbor  is  shown  in  Figure  2  (right  side). 
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Figure  7:  Results  from  the  Gaussian  simulation  with  ten  nearest  neighbors,  (blue  =  ML;  red  =  MER;  green  =  MER 
with  prior;  where  the  red  line  disappears,  it  overlaps  the  green  line.) 


Classifying  from  five  nearest  neighbors  and  from  ten  nearest  neighbors  is  shown  in  Figure  6  and  Figure  7,  respectively. 
The  impact  of  the  additional  prior  information  is  seen  to  improve  performance. 

8  Discussion 

In  this  work  we  have  investigated  how  minimum  expected  risk  and  minimum  expected  cost  principles  can  improve 
average  performance  for  near-neighbor  classification  methods.  We  have  shown  that  for  symmetric  misclassification 
costs  there  is  no  difference  to  using  prior  information  with  maximum  likelihood  or  minimum  expected  risk.  However, 
the  more  asymmetric  the  misclassification  costs,  the  greater  benefit  to  be  gained  from  the  minimum  expected  risk 
principle.  Even  in  a  simple  two-dimensional  simulation,  the  classification  cost  is  as  much  as  forty  times  smaller  using 
the  minimum  expected  risk  estimate. 

The  use  of  correct  prior  information  can  have  a  strong  effect  on  the  outcome,  and  we  have  discussed  a  few  different 
cases  for  prior  information.  On  the  one  hand,  the  practitioner  with  good  information  should  be  able  to  communicate 
that  information  to  the  learning  algorithm,  via  the  setting  of  this  prior  on  the  class  pmf’s.  On  the  other  hand,  reasonably 
useful  prior  information  may  be  difficult  to  obtain.  The  theoretical  promise  of  the  name  “minimum  expected  risk 
estimation"  delivers  only  to  the  extent  that  the  prior  information  is  useful. 

This  research  leaves  open  some  questions.  One  is  how  to  derive  useful  prior  information  from  the  training  data.  In 
this  paper  we  showed  how  to  use  the  global  likelihood,  but  it  may  be  possible  to  estimate  other  prior  information.  A 
second  question  is  how  to  analytically  solve  for  estimates  given  other  prior  information.  For  the  case  of  limited  prior 
conditional  probability  of  a  class,  such  as  9 1  <  .8,  we  gave  results  for  the  two-class  case  by  using  the  incomplete  beta 
function.  Solving  for  more  classes  would  require  a  multiple  incomplete  beta  function,  of  which  the  authors  are  not 
aware.  Other  prior  information  might  not  lead  to  analytic  constraints,  and  the  integration  of  risk  might  require  Monte 
Carlo  sampling. 

Lastly,  this  paper  has  focused  on  minimizing  expected  misclassification  cost  for  near-neighbor  learning.  Many 
other  approaches  to  learning  use  ML  estimation  steps  which  could  be  replaced  by  MER  estimation.  Investigating  the 
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application  of  MER  estimation  to  other  learning  paradigms  may  be  an  interesting  avenue  of  future  research. 
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Appendix 


In  this  appendix  we  prove  the  results  stated  in  the  paper  in  the  order  the  results  appear. 

Proof  of  Lemma  1.  The  proof  of  Lemma  1  is  the  same  as  the  proof  for  Lemma  3,  with  the  parameter  v  set  to  zero. 
Proof  of  Theorem  1.  The  MER  estimate  9  solves  argmin^p^),  where 


=  Ee[R(Q,(j>)]  =  -Ee[dy, (©,</>)]  =  /  d^(9,<f))f(9)d9. 

Je 

Substituting  the  definition  of  Bregman  divergence  from  equation  (7),  we  write 


(26) 


p(4>)  =  /  mo)  -  V#)  -  (0  -  n  e7de- 

Je  5—1 

Differentiating  both  sides  with  respect  to  <fi. 


(27) 


Vp($  =  f  i-Vi’tt)  +  V^(0)  -  Vfy($(0  -  <t>)\  f[  OgadO 

5=1 

=  -  /[V2V#)(^)]^059^■ 

•'e  0=1 


(28) 

(29) 


Setting  the  first-order  optimality  condition  Vp((/f>)  equal  to  zero  and  solving  for  <j>, 

Q 


Vp(0)  =  0  =>  V2^(</>)  f  (9 -</))] T  9ggd9  =  0. 
J° 


(30) 


Because  of  the  strict  convexity  of  ip,  V2t/>(</>)  is  a  Q  x  Q  positive  definite  Hessian  matrix,  and  thus  equation  (30) 
implies  that 


/'  A  fjnGa-iVg9dd 

/  (O  -  <t>)  TT  9a9d9  =  0  cf)  =  Je  i"1  * - . 

J*  5=i  fel&sWM 

Taking  the  gth  element  of  vector  equation  (31), 


leMlU^dO 


\G  naa 

lg=l  9g 

Using  Dirichlet’s  integral  [32,  pgs.  32-34]  equation  (32)  can  be  written  as 


IoUU^9d0 


(31) 


(32) 


{r(ai  +  1) . . .  T(ag-i  +  l)r(ap  +  2)T(ag+i  +  1) . . .  r (ag  +  l)}/r((?  +  1  +  ^2j= i  OLj) 
[T(ai  +  l)r(a2  +  1) . . .  r (ag  +  1)\/Y(Q  +  J2j=i  aj ) 
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which  simplifies  to 


( ag  +  l)tn3=i  r(as  +  1)\/T(Q  +  1  +  Yhj= 1  aj) 
[nLrK  +  iJl/r^  +  EL^)  ' 


and  thus, 


0n  = 


aa  +  1 


9  sr^Q  i  r>  ’ 

S9=i  ag  +  G 


5=1,2,. 


which  is  the  estimate  given  in  (8)  of  Theorem  1. 

An  alternative  proof  uses  a  recent  result  [20,  Theoreml],  which  establishes  that  the  minimizer  of  E[d^,(R,  s)]  is 
the  mean  of  R.  Recognizing  that  <j>g  in  (32)  is  the  mean  of  the  random  variable  B,;  with  respect  to  the  normalized  pdf 

m 


iemdo ’ 

it  follows  that  </>9  is  the  sought  minimizer.  More  recent  results  on  minimizing  the  expected  Bregman  divergence  are 
given  in  [21], 

Proof  of  Lemma  2.  The  first  part  of  this  proof  follows  the  proof  of  Theorem  1,  except  that  f(9)  in  Theorem  1  is 
replaced  by  f(9)q(9).  Equation  (32)  becomes. 


<t>i  = 


ca  -1-7-2  na„ 


/;1=0n,=1c^i 

The  numerator  and  denominator  may  be  written  as  standard  incomplete  Beta  functions  B.  Thus  (33)  becomes 

,  ,B(a,  OL 1  +  2,  012  +  1) 


(33) 


B(a,  ol i  +  1, 0L2  +  1) 


as  stated  in  the  lemma. 


(34) 


Proof  of  Lemma  3.  The  Lagrangian  1(9)  of  the  problem  is 

Q 


[ k  E)= i  «’jt(y,=9)+''E"=i  UiI(Y.=g)] 


m = n  °9 

5=1 

where  A  is  the  Lagrange  multiplier,  with  the  conditions 


-a  E^-1 

\s=i  / 


a£>5-1)=0, 


5=1 


(35) 


E  w)  =  1Etti  =  L 

i=l  i= 1 


(36) 


Use  the  first  order  condition  to  solve  for  the  optimal  solution.  When  differentiating  with  respect  to  the  variable  Of, 
and  setting  it  equal  to  zero,  we  have  for  h  =  1, 2, . . . ,  Q, 

[kJ2j  =  lwjI(Yj=h)  +  vJ2i=luiI(Yi=h)\  JJ-  I2j=l  ™jI(Yj=g)+vI2?=l  uiI(vi=g)]  ^27) 

Oh  g=1 

Taking  any  two  of  Q  equations  of  (37)  and  eliminating  A  yields 

k  Ej  =  l  w3I(Yj^h)  +  V  Etl  UjI(Yi=h)  _  k  Ej  =  1  +  v  sr=i 

Oh  ~  k 
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which,  solving  for  9h,  gives 


k  Ej=l  w3hYi=h)  +  V  EiLl  uihYi=h) 

k  Z)jLl  wjI(Yj=h)  +  ^  Z)"=l  uiI(Yi=h) 


e-h  h  =  i,...,g. 


Then,  using  the  constraint  (35)  on  f)i,  gives 

Q  k  n  k  n 

h^[k'52wjI(YJ=h)+v'52uiI(Yi=h)\  =  k^Wj^Y^h)  +vJ2uJ[Yi=Tl). 

h—  1  j— 1  i— 1  ^=1  4—1 

Since  Yfg= Y^j= i  wj^(Xo=g)  +  11  ]C”=i  Mj/(yi=g)]  =  k  +  v  by  (36),  the  MAP  estimate  §h  is  as  stated  in  the 


lemma. 


Proof  of  Corollary  1.  The  following  are  equivalent: 


argming  Ee[^2  C(g,  h)Qh] 


h=l 


=  argming  ^  C(g,  h)Ee[Qh\ 
h= 1 
G 

=  argming  C(g,h)9h , 


(38) 

(39) 

(40) 


7i=l 


where  (39)  follows  from  (38)  by  the  linearity  of  expectation,  and  (40)  follows  from  (39)  because  the  MER  estimate  9h 
as  stated  in  Theorem  1,  Equation  (8)  is  equal  to  Eq[Q^]  as  shown  in  the  proof  of  Theorem  1  (32). 

Proof  of  Lemma  4.  First  we  show  that  9±}ml  >  \  implies  9i,mer  >  \-  Rewriting  0-\  m e r  in  terms  of  9i,ml, 

2  M^i,ml)+2(|)  k{\)  +  2(3)  (k  +  2)(|)  1 

Vi  ,mer  —  - ; — — -  > 


k  +  2 

To  prove  the  reverse,  assume  that  6i,mer  >1/2, 

Si, MER 


■  2 


k  +  2 


k(9iML)  +  2(g)  1 

k  +  2  >  2' 


Cross-multiplying  and  solving  for 


k(0i,ML)  +  2(-)  >  (fc  +  2) -  —  —  +  1 


1 


A, ML  >  2' 
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